Disable autogenerate session id #21886

toillium · 2025-11-16T18:05:18Z

What does this PR do?

Makes the clickhouse client thread safe by disabling autogenerate_session_id.

Relevant docs: https://clickhouse.com/docs/integrations/language-clients/python/driver-api#multi-threaded-applications

Motivation

#21885

toillium · 2025-11-16T18:05:56Z

cc @sarah-witt @steveny91

codecov · 2025-11-16T18:21:42Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.03%. Comparing base (8c6c226) to head (a4c8772).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

AAraKKe

Thanks for the PR @toillium ! I am not 100% sure why this would be an issue since as far as I can tell the check runs synchronously. Would you mind sharing where do you think the multi-threaded behavior is?

The error you report seeing is

HTTPDriver for <URL> received ClickHouse error code 373\n Code: 373. DB::Exception: Session 7de98854-bff0-481b-a959-aa18f3ec9e4d is locked by a concurrent client. (SESSION_IS_LOCKED)

This means that the error is coming from the server returning 373 right? Is it possible that somehow the query launched against clickhouse is timing out somehow but we reuse the client and launch a new query against the server, causing the session already used error?

I am bit worried that by not generating the session ID we might be hiding an issue with timeouts for instance that we should be resurfacing to customers.

My Feedback Legend

Here's a quick guide to the prefixes I use in my comments:

praise: no action needed, just celebrate!
note: just a comment/information, no need to take any action.
question: I need clarification or I'm seeking to understand your approach.
nit: A minor, non-blocking issue (e.g., style, typo). Feel free to ignore.
suggestion: I'm proposing an improvement. This is optional but recommended.
request: A change I believe is necessary before this can be merged.

The only blocking comments are request, any other type of comment can be applied at discretion of the developer.

clickhouse/datadog_checks/clickhouse/clickhouse.py

toillium · 2025-11-24T15:38:39Z

Would you mind sharing where do you think the multi-threaded behavior is?

I think the multithreaded behavior is coming from datadog agent's query scheduling.

This means that the error is coming from the server returning 373 right? Is it possible that somehow the query launched against clickhouse is timing out somehow but we reuse the client and launch a new query against the server, causing the session already used error?

I think the issue is query A launches with session id foo, then shortly after query B launches with session id foo while query A is still running (implies multithreading is being used). Query B times out while waiting for the lock on the session id, and then query A completes sometime later.

I am bit worried that by not generating the session ID we might be hiding an issue with timeouts for instance that we should be resurfacing to customers.

I don't think session id impacts how timeouts are surfaced. My understanding is that it's only used to determine the lifespan of temporary resource on the clickhouse server (e.g. temporary tables).

joe-clickhouse · 2025-11-25T18:17:26Z

Hi all, Joe from clickhouse here (maintainer of clickhouse-connect). Setting autogenerate_session_id=False should indeed resolve the SESSION_IS_LOCKED errors. This makes the client "safe enough" for typical concurrent query usage. Just note that it's not FULLY thread-safe. e.g. avoid calling methods like set_client_setting() or similar concurrently with queries. But for the use case of concurrent queries without modifying client state, you should be good to go.

Hope that helps! Happy to help provide more context or answer any other clickhouse-connect questions as well.

AAraKKe · 2025-11-27T14:32:02Z

Thanks guys!

I think the multithreaded behavior is coming from datadog agent's query scheduling.

That is my point, the agent itself does not manage any queries, this is done at an integration level and the integration itself is synchronous. What I am afraid is happening is the following, @joe-clickhouse maybe you can correct me if this is not how the client behaves.

The agent calls the check.
The client is created with a given send_receive_timeout and connect_timeout.
Query manager sends 20 queries, sequentially.
Before finishing to execute the query we hit the timeout in the client, stop listening but the queries are still running in clickhouse.
Next time the agent calls the check, since we already have the client defined (with a given session id) we reuse the client to launch the same queries again.
Clickhouse receives a query with the same session as one that is already running and returns 373.

I am not sure if clickhouse will cancel ongoing queries if the client timesout. I have worked with datastores where a query is not cancelled if the client disconnects, or marked as stale. In a sense, if we call the datastore we can run into a scenario in which a pending query is blocking the session.

If this is possible with clickhouse I am worried that the customers are having queries timing out they do not know about. That is what I would try address.

@joe-clickhouse do you think this is possible? While removing the session id generation is a fix for the errors, we should get it merged, I would like to understand if this is scenario is possible and we should improve our error handling. Also, if there is any way to ensure clickhouse cancels an ongoing query. These could start accumulating generating load on the cluster for queries that no one is going to listen to.

joe-clickhouse · 2025-12-01T19:46:36Z

Hi @AAraKKe, yes this definitely a possible scenario. So if that's the use case, then this is definitely a valid concern.

So, in it's current state if we assume this scenario:

agent calls the check and a client is created with a session_id (default behavior)
client sends queries sequentially
client timeout occurs and the client will stop waiting for a response
queries will continue running on the CH server
next agent check runs and reuses the same client with the same session_id

This will cause new queries to collide with the still-running queries and raise 373 errors. This PR will prevent automatic client-side session_id generation which will fix (or prevent) the 373 errors from occurring.

That said, @AAraKKe your instinct is correct and if the client times out the sever will not (by default) cancel the query so query accumulation is a valid concern in this case if you're having client connection/timeout issues and submitting the same thing multiple times.

There are a few things we could do to mitigate these kinds of potential issues however:

The most reliable approach would probably be setting the server-side setting max_execution_time (see docs) to a value slightly less (like maybe 10 seconds or so) than the client-side setting send_receive_timeout (default of 300s). This should prevent any queries from ever reaching the client timeout. This also looks safe in datadog's QueryExecutor.
this one is less reliable I think but on the server we can set cancel_http_readonly_queries_on_client_close=1 (see docs) which will tell the server that if it detects the connection close, it should cancel the query. Note that this isn't 100% guaranteed to cancel the query immediately. CH will cancel the query when it detects the connection close, but if it's deep in query execution and not frequently checking for client disconnects, there may be a delay before cancellation actually happens.

Side note, I think your scenario is just theoretical but the default client-side timeout default is 5 minutes. I think it'd be pretty rare for correctly designed and working queries to regularly hit that limit so if for some reason they are, it's worth a different discussion.

Hope that helps!

AAraKKe · 2025-12-04T11:16:10Z

Thanks @joe-clickhouse! I appreciate your help with this, I will follow up with the team and come up with a solution. What I am thinking would be the best approach is:

Add a client_timeout metric to the integration so customers know that their client is timing out so they can reconfigure timeouts or investigate further.
Update the documentation to help them understand what can be done to improve this following your explanation on the first bullet point.

Regarding the default timeout, that is the default on the clickhouse_connect client side, but the integration has a 10s default or whatever the customer sets in their integration config, so it is pretty much on the customer hands to land in this scenario. See here.

I am approving the PR to prevent the 373 for now and we will update the integration to handle that appropriately.

Thanks guys!

Disable autogenerate session id

1ad4b84

toillium requested a review from a team as a code owner November 16, 2025 18:05

temporal-github-worker-1 bot added agent/review-requested ecosystems/review-requested product/review-requested labels Nov 16, 2025

datadog-agent-integrations-bot bot added integration/clickhouse team/agent-integrations labels Nov 16, 2025

Update unit tests

4885db8

toillium requested a review from a team as a code owner November 16, 2025 18:16

temporal-github-worker-1 bot added the docs/review-requested label Nov 16, 2025

datadog-agent-integrations-bot bot added documentation team/documentation labels Nov 16, 2025

toillium force-pushed the joe/disable-client-session-ids branch 2 times, most recently from 624bfd2 to f15e379 Compare November 16, 2025 18:34

Update changelog

68d1ce7

toillium force-pushed the joe/disable-client-session-ids branch from f15e379 to 68d1ce7 Compare November 16, 2025 18:35

urseberry approved these changes Nov 17, 2025

View reviewed changes

temporal-github-worker-1 bot added docs/approved and removed docs/review-requested labels Nov 17, 2025

AAraKKe self-assigned this Nov 24, 2025

AAraKKe reviewed Nov 24, 2025

View reviewed changes

clickhouse/datadog_checks/clickhouse/clickhouse.py Show resolved Hide resolved

toillium force-pushed the joe/disable-client-session-ids branch from 9fe1a22 to 9cc9c01 Compare November 24, 2025 15:44

add comment to autogenerate_session_id

3b6a98e

toillium force-pushed the joe/disable-client-session-ids branch from 37e509e to 3b6a98e Compare November 24, 2025 16:03

Merge branch 'master' into joe/disable-client-session-ids

9087f7b

toillium requested a review from AAraKKe November 25, 2025 18:43

AAraKKe approved these changes Dec 4, 2025

View reviewed changes

temporal-github-worker-1 bot added agent/approved and removed agent/review-requested labels Dec 4, 2025

AAraKKe added this pull request to the merge queue Dec 4, 2025

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Dec 4, 2025

Merge branch 'master' into joe/disable-client-session-ids

a4c8772

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable autogenerate session id #21886

Disable autogenerate session id #21886

toillium commented Nov 16, 2025 •

edited

Loading

Uh oh!

toillium commented Nov 16, 2025

Uh oh!

codecov bot commented Nov 16, 2025 •

edited

Loading

Uh oh!

AAraKKe left a comment •

edited

Loading

Uh oh!

Uh oh!

toillium commented Nov 24, 2025

Uh oh!

joe-clickhouse commented Nov 25, 2025

Uh oh!

AAraKKe commented Nov 27, 2025 •

edited

Loading

Uh oh!

joe-clickhouse commented Dec 1, 2025 •

edited

Loading

Uh oh!

AAraKKe commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Disable autogenerate session id #21886

Are you sure you want to change the base?

Disable autogenerate session id #21886

Conversation

toillium commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Uh oh!

toillium commented Nov 16, 2025

Uh oh!

codecov bot commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AAraKKe left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

toillium commented Nov 24, 2025

Uh oh!

joe-clickhouse commented Nov 25, 2025

Uh oh!

AAraKKe commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joe-clickhouse commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AAraKKe commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

toillium commented Nov 16, 2025 •

edited

Loading

codecov bot commented Nov 16, 2025 •

edited

Loading

AAraKKe left a comment •

edited

Loading

AAraKKe commented Nov 27, 2025 •

edited

Loading

joe-clickhouse commented Dec 1, 2025 •

edited

Loading